AITopics | open-domain dialog system

Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Neural Information Processing SystemsDec-26-2025, 04:18:01 GMT

Building an open-domain conversational agent is a challenging problem. Current evaluation methods, mostly post-hoc judgments of static conversation, do not capture conversation quality in a realistic interactive context. In this paper, we investigate interactive human evaluation and provide evidence for its necessity; we then introduce a novel, model-agnostic, and dataset-agnostic method to approximate it. In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. We show that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r> .7,

approximating interactive human evaluation, name change, open-domain dialog system, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.79)

Add feedback

Reviews: Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Neural Information Processing SystemsJun-2-2025, 01:09:23 GMT

The paper attempts to move away from traditional evaluation of open-domain dialog systems (i.e., judge response given its conversation history) and moves towards a more interactive one (i.e., human talking to a bot), which is likely an important step towards better evaluation. However, I do have several serious concerns about this work in its current form: (1) The authors contrast their work with existing evaluation for open-domain dialog evaluation, which they call "single-turn" evaluation. They point out that this type of evaluation prevents it from capturing "failure modes […] such as a lack of diversity in the responses, inability to track long-term aspects of the conversation". I think this is rather misleading and the term is "single-turn" is a misnomer. Most previous work has indeed evaluated each conversation by factorizing it into a sequence of independent turn-level judgments, but each of these judgments assesses the quality of the current turn T_n **given** a history of several previous turns …, T_n-k, … T_n-1.

approximating interactive human evaluation, evaluation, open-domain dialog system, (11 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.64)

Add feedback

Reviews: Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Neural Information Processing SystemsJun-2-2025, 01:09:11 GMT

This paper explores interesting directions, in particular 1) using interactive settings to evaluate a model rather than a single answer, and 2) combining different automated metrics in a weighted sums to approximate human evaluation (e.g., based on sentiment). Reviewers have raised crucial points, regarding gameability (so that using the metrics for training a model is tricky if not followed by a non-gameable evaluation), and lack of comparability between different self-play. It's indeed a much better evaluation setting if the system does not control both sides (e.g., models being matched to the same set of fixed models), so authors should definitely follow that direction. However, I expect this work would still be interesting to the dialog community: many of the diagnostic advantages of the model-talking-to-model setting remain, in practice, especially because the model is in fact not trained with the self-play objective, but that criterion is only used post hoc (so the system can't extensively exploit it during training). In practice, a lot of the problems of the generations of a given model already show up during self-play, and the reasonable worry raised by reviewers that the model could exploit the metric remains theoretical at the moment.

approximating interactive human evaluation, open-domain dialog system, self-play, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.40)

Add feedback

Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Neural Information Processing SystemsOct-11-2024, 07:55:32 GMT

Building an open-domain conversational agent is a challenging problem. Current evaluation methods, mostly post-hoc judgments of static conversation, do not capture conversation quality in a realistic interactive context. In this paper, we investigate interactive human evaluation and provide evidence for its necessity; we then introduce a novel, model-agnostic, and dataset-agnostic method to approximate it. In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. We show that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r .7,

approximating interactive human evaluation, open-domain dialog system, self-play, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.64)

Add feedback

Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Ghandeharioun, Asma, Shen, Judy Hanwen, Jaques, Natasha, Ferguson, Craig, Jones, Noah, Lapedriza, Agata, Picard, Rosalind

Neural Information Processing SystemsMar-19-2020, 02:16:28 GMT

Building an open-domain conversational agent is a challenging problem. Current evaluation methods, mostly post-hoc judgments of static conversation, do not capture conversation quality in a realistic interactive context. In this paper, we investigate interactive human evaluation and provide evidence for its necessity; we then introduce a novel, model-agnostic, and dataset-agnostic method to approximate it. In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. We show that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r .7,

approximating interactive human evaluation, artificial intelligence, natural language, (4 more...)

Neural Information Processing Systems

Genre: Research Report (0.42)

Technology: Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.64)

Add feedback

Challenges in Building Intelligent Open-domain Dialog Systems

Huang, Minlie, Zhu, Xiaoyan, Gao, Jianfeng

arXiv.org Artificial IntelligenceMay-12-2019

There is a resurgent interest in developing intelligent open-domain dialog systems due to the availability of large amounts of conversational data and the recent progress on neural approaches to conversational AI. Unlike traditional task-oriented bots, an open-domain dialog system aims to establish long-term connections with users by satisfying the human need for communication, affection, and social belonging. This paper reviews the recent works on neural approaches that are devoted to addressing three challenges in developing such systems: semantics, consistency, and interactiveness. Semantics requires a dialog system to not only understand the content of the dialog but also identify user's social needs during the conversation. Consistency requires the system to demonstrate a consistent personality to win users trust and gain their long-term confidence. Interactiveness refers to the system's ability to generate interpersonal responses to achieve particular social goals such as entertainment, conforming, and task completion. The works we select to present here is based on our unique views and are by no means complete. Nevertheless, we hope that the discussion will inspire new research in developing more intelligent dialog systems.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

1905.05709

Country:

North America > United States > California > San Francisco County > San Francisco (0.28)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
(38 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems

Tao, Chongyang (Peking University) | Mou, Lili (University of Waterloo) | Zhao, Dongyan (Peking University) | Yan, Rui (Peking University)

AAAI ConferencesFeb-8-2018

Open-domain human-computer conversation has been attracting increasing attention over the past few years. However, there does not exist a standard automatic evaluation metric for open-domain dialog systems; researchers usually resort to human annotation for model evaluation, which is time- and labor-intensive. In this paper, we propose RUBER, a Referenced metric and Unreferenced metric Blended Evaluation Routine, which evaluates a reply by taking into consideration both a groundtruth reply and a query (previous user-issued utterance). Our metric is learnable, but its training does not require labels of human satisfaction. Hence, RUBER is flexible and extensible to different datasets and languages. Experiments on both retrieval and generative dialog systems show that RUBER has a high correlation with human annotation, and that RUBER has fair transferability over different datasets.

dialog system, machine learning, natural language, (17 more...)

AAAI Conferences

Thirty-Second AAAI Conference on Artificial Intelligence

Country: Asia > China (0.14)

Genre: Research Report (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Filters

Collaborating Authors

open-domain dialog system

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Reviews: Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Reviews: Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Challenges in Building Intelligent Open-domain Dialog Systems

RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems